MacKinlay, Andrew, Rebecca Dridan, Dan Flickinger and Timothy Baldwin (2010) Cross-Domain Effects on Parse Selection for Precision Grammars, Research on Language and Computation 8(4), pp. 299-340

نویسنده

  • ANDREW MACKINLAY
چکیده

We examine the impact of domain on parse selection accuracy, in the context of precision HPSG parsing using the English Resource Grammar, using two training corpora and four test corpora and evaluating using exact tree matches as well as dependency F-scores. In addition to determining the relative impact of invs. cross-domain parse selection training on parser performance, we propose strategies to avoid cross-domain performance penalty when limited in-domain data is available. Our work supports previous research showing that in-domain training data significantly improves parse selection accuracy, and that it provides greater parser accuracy than an out-of-domain training corpus of the same size, but we verify experimentally that this holds for a handcrafted grammar, observing a 10–16% improvement in exact match and 5–6% improvement in dependency F-score by using a domain-matched training corpus. We also find it is possible to considerably improve parse selection accuracy through construction of even small-scale in-domain treebanks, and learning of parse selection models over in-domain and out-of-domain data. Naively adding an 11000-token in-domain training corpus boosts dependency F-score by 2–3% over using solely out-of-domain data. We investigate more sophisticated strategies for combining data from these sources to train models: weighted linear interpolation between the single-domain models, and training a model from the combined data, optionally duplicating the smaller corpus to give it a higher weighting. The most successful strategy is training a monolithic model after duplicating the smaller corpus, which gives an improvement over a range of weightings, but we also show that the optimal value for these parameters can be estimated on a case-by-case basis using a cross-validation strategy. This domain-tuning strategy provides a further performance improvement of up to 2.3% for exact match and 0.9% for dependency F-score compared to the naive combination strategy using the same data. rolc-crossdom.tex; 22/09/2011; 10:20; p.2

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Treeblazing: Using External Treebanks to Filter Parse Forests for Parse Selection and Treebanking

We describe “treeblazing”, a method of using annotations from the GENIA treebank to constrain a parse forest from an HPSG parser. Combining this with self-training, we show significant dependency score improvements in a task of adaptation to the biomedical domain, reducing error rate by 9% compared to out-of-domain gold data and 6% compared to self-training. We also demonstrate improvements in ...

متن کامل

Unsupervised Parse Selection for HPSG

Parser disambiguation with precision grammars generally takes place via statistical ranking of the parse yield of the grammar using a supervised parse selection model. In the standard process, the parse selection model is trained over a hand-disambiguated treebank, meaning that without a significant investment of effort to produce the treebank, parse selection is not possible. Furthermore, as t...

متن کامل

The Effects of Semantic Annotations on Precision Parse Ranking

We investigate the effects of adding semantic annotations including word sense hypernyms to the source text for use as an extra source of information in HPSG parse ranking for the English Resource Grammar. The semantic annotations are coarse semantic categories or entries from a distributional thesaurus, assigned either heuristically or by a pre-trained tagger. We test this using two test corpo...

متن کامل

Baldwin, Timothy (2007) Scalable Deep Linguistic Processing: Mind the Lexical Gap, In Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation (PACLIC21), Seoul, Korea, pp. 3-12

Coverage has been a constant thorn in the side of deployed deep linguistic processing applications, largely because of the difficulty in constructing, maintaining and domain-tuning the complex lexicons that they rely on. This paper reviews various strands of research on deep lexical acquisition (DLA), i.e. the (semi-)automatic creation of linguistically-rich language resources, particularly fro...

متن کامل

Letcher, Ned, Rebecca Dridan and Timothy Baldwin (2015) gDelta: A Missing Link in the Grammar Engineering Toolchain, Language Resources and Evaluation

The development of precision grammars is an inherently resource-intensive process; their complexity means that changes made to one area of a grammar often introduce unexpected flow-on effects elsewhere in the grammar which may only be discovered after some time has been invested in updating numerous test suite items. In this paper, we present the browser-based gDelta tool, which aims to provide...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011